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Article History Abstract 

Received: 14 August 2023 Figuring out how a document has changed from one version to another isn’t 
Accepted: 28 September 2023 always the simplest task. We encountered the problem of comparing two PDF 
Published: 30 September 2023 documents, edited using different editing tools. When we tried to compare 
these PDFs, using existing comparison tools, comparison results were not sat- 


Keywords: ; : f i é 
tau Deeloe isfactory. After analysis, we found that, if documents had been edited using 
PDF com parison; any other tool than acrobat(non-Native), then these tools were unable to detect 
PDF Page labelling; the proper layout (para, header, footer, columns, tables etc.) of the document 
Image change tracking and therefore unable to sequence them in correct order resulting in false com- 

parison output. To overcome this problem, we tried latest developments in 
computer vision to detect the layout information of the document. Using lay- 
out information, contents were arranged in correct order and then compared. 
This resulted in better comparison output. Also, using AI for layout detection 
made it independent of how the document was created and edited. We built 
a complete framework which includes reading the information, detecting lay- 
out, arranging information, comparing it, and visualizing the differences. This 
Framework can be applied to build any document comparison tool irrespective 
of document type. 

1. Introduction ment formats to store textual contents in the orga- 


nizations. (Kardas et al.) It provides multiple func- 
tionalities to users like search, index, images etc. 
In addition to it, there is also facility to edit and 
add text or images, inside existing PDF. A PDF can 
contain information in formats like Text, Table, and 
Image data. Due to wide acceptance of PDFs and 
functionality of editing, many times we face chal- 
lenge of finding out the differences between origi- 
nal and edited version of the PDF. There are many 
tools available in market which perform these dif- 
ferences and highlight them. (Gemelli, Vivoli, and 


Documents are created to preserve content in the 
easiest manner. One of the greatest advances 
in the digitized era is to store vast amounts of 
data electronically. (Smock, Pesala, and Abraham) 
The dimensions of electronic data are huge than 
paper documents. Electronic documents in place 
of physical documents save cost reduction, storage 
space, portability, zero damage and standard struc- 
ture which in turn results in easy access. It also pro- 
vides efficient ways to store and retrieve informa- 
tion. PDF is one of the predominantly used docu- 
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Marinai) But we found these tools works well when 
source pf PDF document is acrobat but when PDF 
are created or edited from other tools then these 
comparison utilities are unable to align the content 
in proper order and hence their comparison does 
not show correct differences. And this comparison 
becomes more incorrect if PDFs are written in two 
columnar formats. To overcome this challenge, we 
proposed to use advanced ML model to detect the 
PDF. (Bimbo et al.) 

Layout and then arrange the information using 
this layout information. This Machine learning 
model is trained on publicly available PDF docu- 
ments. It takes PDF page as input in format of image 
and provides the bounding boxes for each columnar 
text data in the PDF. Our emphasis is to increase the 
PDF comparison accuracy with advances in AI and 
current technology. 


2. Related work 


There are multiple tools available online as paid ser- 
vices or as opensource for comparison utility. 


2.1. Paid Comparison tools 


Some of the paid comparison tools are Beyond 
Compare, Foxit Compare (Xiong and Foxit), Adobe 
Compare etc. They work great in case of standard 
PDFs. But we found that if PDFs are edited in non- 
standard way or PDFs are generated from different 
sources other than Acrobat reader, they are not able 
to align the information from the PDF which leads 
to highlight incorrect differences. 


2.2. Open-Source Tools 


There are very few compare utilities available where 
comparison result is good enough to use them. One 
of them is pdf-diff (Wu et al.) utility created by Josh- 
Data on github, this utility is created using python 
library and works great if PDFs are of standard form 
and in one columnar format but does not work on 
non- standard and modified PDF’s. Also, it does 
not perform image comparison. It also shows dif- 
ferences in text only and does not show differences 
of space or new lines characters added and removed 
in PDF. 

Some of the other available tools are Diff 
compare, Draftable, Pdfforge, Kiwi pdf com- 
pare. (Jarvis) 

Diff compare highlights differences in side-by- 
side view but fails in terms of accuracy compared 
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with other ones. 

Draftable compares pdf side-by-side view and 
spot the differences based on style and content. It 
works on text and images but fails in case PDFs are 
edited using any other tool than acrobat. 

Pdf forge is a platform to edit, create, convert, and 
organize PDFs. It compares two PDFs based on text 
changes. It offers side-side or inline view. It doesn’t 
work for images. 

Kiwi pdf compare has text and image compare. 
On account of images, pixel to pixel comparison 
comes into play. It works on the pdf from other 
sources too. However, the free version has limits, 
you can compare up to 100 pages. (Islam, Dias, and 
Sunda-Meya) 


| EEN 
el ) 
from page along 


Source/Targe | 
POF _with coordinates 
Convert page to 
image 


(Read ad 


{ Predict bounding 
box of text columns 
using ML model 


i 


fae text using | 


text and bounding 
_ box coordinates 


‘ 


Perform text/image : 
compare between 
source and target PDF 


, 


| Highlight text/image 
differences in both 
|source and target PDF 


FIGURE 1. Process flow diagram 


359 


Maragathamani Subramaniam eft al. 


Beyond Compare 
Adobe Compare 


Feature Comparison of Different PDF Compare Tools 


eats 
feat [Eampwe_[ampnion|emnar far [pr [sue 
Tools Compare Comparison |Columnar Columnar PDF Source 

prom |» | yf» |v] «| + 
Ee 
a 
CA 


fowcomee |v |v |v |v |» | «| 


2023, Vol. 05, Issue 08 September 


FIGURE 2. Feature comparison of different PDF compare tools. 


3. Proposed approach 


In this section we will present complete framework 
created for comparing documents. This framework 
has been created for comparing PDF documents but 
can be applied to other document types also. It uses 
state of the art computer vision ML capability to 
align texts in PDFs and then perform compare. It 
also captures images from PDFs and compares their 
position and content to verify if they have not been 
changed. Once Source and Target PDFs have been 
supplied, it processes them page by page and fetches 
text/images from both (source and target) the PDFs 
along with their respective coordinate in the page. 
Also converts the pages to image format and sends 
them to ML model to get bounding boxes for the 
text columns. Once text coordinates and bounding 
box coordinates are available, text are aligned as per 
their position. Also, if we find texts are overlap- 
ping row wise within same column then considers 
them in same line. After texts are arranged for same 
page in both PDFs, we use text compare utility to 
compare the texts and find out differences in text, 
which includes difference in text and tables. Now 
we compare images available on both the pages for 
their position and content and if any changes are 
found then it is marked as different. Once differ- 
ences are found, we use our existing text coordi- 
nate data and find out coordinates of the differences 


and then use PyMuPDF (Tkachenko et al.) pdf high- 
light functionality to highlight these differences in 
PDF format. If output is requested in image for- 
mat, pages are converted to image and differences 
are highlighted using PIL. Below is the process flow 
diagram of the tasks performed to create this utility. 

Below are the five steps framework to perform the 
PDF compare: 

1. Extract text and Images from Both PDFs 

2. Train Model to detect Layout information. 

3. Align Texts as per layout information. 

4. Compare text and image differences. 

5. Highlight differences. 

A. Extract text and Images from PDFs 

We are using existing Python libraries to fetch 
Text and Images from the PDF files. Text infor- 
mation is captured at character level along with the 
position on the page. Position is captured as a dis- 
tance from left-top corner as coordinates. New line 
and space character information is also captured. 


3.1. Train Model to detect Layout information 


Once character information is captured, we need to 
realign the text information if they are not sequen- 
tially aligned. To align the information, we need to 
know if the PDF page is in one columnar format or 
two columnar format. If information is present in 
two columnar format, then what is the column coor- 
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dinates so that while aligning the text, we can divide 
the text in different columns and then align it, to get 
properly sequenced data. To capture PDF text align- 
ment, we have trained Detectron2 (Wu et al.) model 
which takes PDF page converted to jpeg image as 
input and returns Bounding box information of text 
columns in two columnar PDF format. Below are 
the steps performed to train the Detectron2 model 
for detecting PDF layouts: 


1) Data Collection: As part of data collection, we 
down- loaded publicly available two columnar PDF 
documents. It covered variety of two columnar PDF 
pages. These pages were converted to image format. 


2) Data Labeling: We used label  stu- 
dio (Tkachenko et al.) to create bounding box 
around text columns for two columnar pages. This 
labeled information is exported in Detectron2 input 
format. 


3) Model Training: We used Detectron2 model 
from Facebook and performed transfer learning 
to train on creating bounding boxes around text 
columns in case of two columnar PDF. 


4) Bounding Box Prediction: Page images are 
sent to our fine-tuned model which in turn returns 
bounding box information for each column. 


3.2. Align Texts as per layout information 


Once layout information is available, texts are 
aligned as per the bounding box coordinates. Also, 
if texts are found to be overlapping with 60% are 
greater overlap ratio, then we align them in same 
rows. 


3.3. Compare text and image differences 


Once texts are aligned, we are using Python library 
from Google(diff-match-patch) (Cross et al. Mari- 
nai) to compare this text information from the pages. 
Also, images and their positions are compared, and 
if any difference arises either in image position or 
image content then this image is highlighted. 


3.4. Highlight differences 


Once text and image differences are available, we 
highlight these differences in the easily comprehen- 
sible format. Output of PDF compare can be seen in 
image 4. 


4. Experimental results 
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4.1. Use rule-based system to compare PDF. 


First simple rule-based model was used to arrange 
text in a PDF document. Texts are first arranged 
from top to bottom and then from left to right, this 
method is used by many PDF readers to align texts. 
it performs well on PDFs having only one columnar 
text. But in case of two columnar format PDFs, find- 
ing column separation and using that information to 
arrange text data was not possible since pages can 
contain images, tables etc. in same page. So, we 
used AI bounding box prediction to find columns 
bounding boxes. 

We also tried using OpenCV to identify coordi- 
nates of columns, but different PDFs have different 
way to organize. 

the content. In some PDFs, tables are spanned 
across both the columns and take full width of the 
page. It is difficult to capture columns coordinates 
using OpenCV in such scenarios. We found many 
other formats where capturing layout details using 
OpenCV was difficult. So, we decided to train ML 
model for it. 


4.2. Build model to predict bounding box and 
compare. 


¢ Model training with masked data: We tried 
to mask the text data using OpenCV (é&apos;) 
and train Detectron2 model to detect text col- 
umn bounding boxes, but it doesn’t perform 
well in terms of accuracy. 


¢ Model training without masking data: We 
tried to label unmasked PDF pages using 
LabelStudio and then trained Detectron2 model 
over it. Models gave better results in detecting 
bounding boxes for textual columns. 


5. Conclusion and future work 


In this paper we have presented a framework using 
AI to create documents and compare utility for 
PDFs. We found differences in text, images, tables, 
numbers, or symbols in PDFs. Comparison is done 
in quick time with greater accuracy. It is also inde- 
pendent of how document was edited. We can also 
use this framework to create compare utility for 
other document types. This will help people in mul- 
tiple domains where PDF comparison is required. 
This tool can be useful for document review in orga- 
nizations, and it can present the differences in easy 
to understandable format between different versions 
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1. Employes Theft Coverage 
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FIGURE 3. Column box labelling of PDF page 
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1. Employee Theft Coverage 


We'll pay for loss of or damage to MERI 

securities or other property resuting directly from theft by 
any of your employees. This can be money, securities or 
Se oe eee 


This protection applies only when there is evidence that 
the employee meant to cause you a loss. The evidence 
must also show that he or she or another person or 


would get some unearned financial benefit. Hi] 


Financial benefit doesn't include salaries, commissions, 


or other employee benefits earned in the normal course of 
employment. 


Money means: 
currency, coins and the bank notes in current use and 


having a face value; and 
“travielers checks, registered checks and money orders 
held for sale to the public. 


Secunties mean nonnegotiable Instruments or contacts 
representing ether money or other property and includes: 


* Tokens, tickets, revenue and other stamps 


In currert use, whether represented by actual stamps or 
unused value in a meter; and 

* Evidences of debt Issued in connection 

with credit or change cards not issued by you. 


securities does not Include money. 


cre nan PEAR secure 


Employee means any Individual: 
eee a as er ee 


any ti property of value 


Taide sip 
wages or commissions: 
“whom you have the nght to direct and control while 
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any such person while having care and 
custody of property outside the premises. 


We won't consider any of the following to be an employee: 


* Agent, broker, factor, commissioner merchant, 


consignee, Independent contractor or representative of the 
same general character; or 


* Director or trustee, except while performing acts or duties 
within the scope of the usual duties of an employee. 


2. Forgery Or Alteration Coverage ELHHHGIE| 


We'll pay for loss resulting directly from forgery or 
alteration of a check, draft, promissory note, bill of 
exchange or a similar promise of payment that are: 
“made or drawn by you: 

* drawn upon you; 

made or drawn by someone acting as your agent: or 
* claimed to have been so made or drawn. 


Legal expenses. You May refuse to pay a check, draft, 
or similar order for payment because it may have been 
forged or altered. 

if this happens and someone sues you for Paya. we'll 
pay for any reasonable legal expenses if we've given our 
written consent to the defense of the suit. The amount 
we'll pay for such legal expenses will be in addition to the 
lima of coverage that applies. 


IM And Securities Coverage 


Loss Inside your building. We'll pay for loss of 
money and securities Inside your building or inside a bank 
that results directly from a covered cause of loss. 


bined heads ema building you 
occupy in conducting your business. 


Bank means the Interior of that portion of any building 
occupied by a banking Institution or similar safe depository. 
Covered causes of loss are: 


FIGURE 4. Bounding box prediction for page with images 


of same document. For academics this serves as a 
useful tool to get extra information handy without 
thoroughly reading through the document. In review 
process, this will also enhance the accuracy of com- 
parison. This tool also overcomes the limitations set 
by the open-source tools in the market like image 
comparison, table comparison and deals with two 
columnar PDFs created from different sources. 


We can train and replace the model to deal with 
more complex PDF files. Also, this framework can 
be extended to other document types like docx, pptx, 
etc: 
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